The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious as to what the history of hurricanes and tropical storms looked like so I found a data set on data.world and started some basic Exploratory data analysis (EDA).
EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are pretty handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but the reality is clear, those tools are very effective and efficient when used properly. However, if you’re reading this, you’re probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let’s get started.
The original source of the data was can be found at DHS.gov.
| FID | YEAR | MONTH | DAY | AD_TIME | BTID | NAME | LAT | LONG | WIND_KTS | PRESSURE | CAT | BASIN | Shape_Leng |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2001 | 1957 | 8 | 8 | 1800Z | 63 | NOTNAMED | 22.5 | -140.0 | 50 | 0 | TS | Eastern Pacific | 1.140175 |
| 2002 | 1961 | 10 | 3 | 1200Z | 116 | PAULINE | 22.1 | -140.2 | 45 | 0 | TS | Eastern Pacific | 1.166190 |
| 2003 | 1962 | 8 | 29 | 0600Z | 124 | C | 18.0 | -140.0 | 45 | 0 | TS | Eastern Pacific | 2.102380 |
| 2004 | 1967 | 7 | 14 | 0600Z | 168 | DENISE | 16.6 | -139.5 | 45 | 0 | TS | Eastern Pacific | 2.121320 |
| 2005 | 1972 | 8 | 16 | 1200Z | 251 | DIANA | 18.5 | -139.8 | 70 | 0 | H1 | Eastern Pacific | 1.702939 |
| 2006 | 1976 | 7 | 22 | 0000Z | 312 | DIANA | 18.6 | -139.8 | 30 | 0 | TD | Eastern Pacific | 1.600000 |
Fortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of “ID” columns.
The description as given by DHS.gov:
This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6-hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000-scale data.
| YEAR | MONTH | DAY | WIND_KTS | PRESSURE | |
|---|---|---|---|---|---|
| Min. :1851 | Min. : 1.000 | Min. : 1.00 | Min. : 10.00 | Min. : 0.0 | |
| 1st Qu.:1928 | 1st Qu.: 8.000 | 1st Qu.: 8.00 | 1st Qu.: 35.00 | 1st Qu.: 0.0 | |
| Median :1970 | Median : 9.000 | Median :16.00 | Median : 50.00 | Median : 0.0 | |
| Mean :1957 | Mean : 8.541 | Mean :15.87 | Mean : 54.73 | Mean : 372.3 | |
| 3rd Qu.:1991 | 3rd Qu.: 9.000 | 3rd Qu.:23.00 | 3rd Qu.: 70.00 | 3rd Qu.: 990.0 | |
| Max. :2008 | Max. :12.000 | Max. :31.00 | Max. :165.00 | Max. :1024.0 |
We can confirm that this particular data had storms from 1851 - 2010, that means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, “if I need to, I can put dates in an easy to use format such as YYYY-mm-dd (2017-09-12)!”
This is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, we should keep it in mind when we start to develop hypotheses.
You will notice the data starts at 1950 rather than 1851. I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the “ID” columns. However, this is a preliminary analysis so I didn’t want to dig too deep.
| YEAR | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975 | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988 | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 |
| Distinct_Storms | 10 | 6 | 8 | 8 | 10 | 7 | 9 | 10 | 13 | 13 | 20 | 12 | 15 | 14 | 15 | 24 | 25 | 26 | 23 | 26 | 30 | 21 | 20 | 28 | 25 | 24 | 14 | 30 | 18 | 25 | 27 | 28 | 25 | 33 | 34 | 23 | 26 | 26 | 28 | 35 | 21 | 33 | 23 | 27 | 29 | 21 | 27 | 27 | 21 | 34 | 30 | 27 | 32 | 27 | 43 | 28 | 26 | 33 |
| Distinct_Storms_Change | -3 | -4 | 2 | 0 | 2 | -3 | 2 | 1 | 3 | 0 | 7 | -8 | 3 | -1 | 1 | 9 | 1 | 1 | -3 | 3 | 4 | -9 | -1 | 8 | -3 | -1 | -10 | 16 | -12 | 7 | 2 | 1 | -3 | 8 | 1 | -11 | 3 | 0 | 2 | 7 | -14 | 12 | -10 | 4 | 2 | -8 | 6 | 0 | -6 | 13 | -4 | -3 | 5 | -5 | 16 | -15 | -2 | 7 |
| Distinct_Storms_Pct_Change | -0.23 | -0.40 | 0.33 | 0.00 | 0.25 | -0.30 | 0.29 | 0.11 | 0.30 | 0.00 | 0.54 | -0.40 | 0.25 | -0.07 | 0.07 | 0.60 | 0.04 | 0.04 | -0.12 | 0.13 | 0.15 | -0.30 | -0.05 | 0.40 | -0.11 | -0.04 | -0.42 | 1.14 | -0.40 | 0.39 | 0.08 | 0.04 | -0.11 | 0.32 | 0.03 | -0.32 | 0.13 | 0.00 | 0.08 | 0.25 | -0.40 | 0.57 | -0.30 | 0.17 | 0.07 | -0.28 | 0.29 | 0.00 | -0.22 | 0.62 | -0.12 | -0.10 | 0.19 | -0.16 | 0.59 | -0.35 | -0.07 | 0.27 |
| Distinct_Storms | Distinct_Storms_Change | Distinct_Storms_Pct_Change | |
|---|---|---|---|
| Min. : 6.00 | Min. :-15.0000 | Min. :-0.42000 | |
| 1st Qu.:15.75 | 1st Qu.: -3.0000 | 1st Qu.:-0.12000 | |
| Median :25.00 | Median : 1.0000 | Median : 0.04000 | |
| Mean :22.81 | Mean : 0.3448 | Mean : 0.05966 | |
| 3rd Qu.:28.00 | 3rd Qu.: 3.7500 | 3rd Qu.: 0.25000 | |
| Max. :43.00 | Max. : 16.0000 | Max. : 1.14000 |
## # A tibble: 257 x 4
## # Groups: CAT [5]
## CAT YEAR Distinct_Storms Distinct_Storms_Pct_Change
## <fctr> <int> <int> <dbl>
## 1 H1 1950 12 NA
## 2 H1 1951 8 -0.33
## 3 H1 1952 6 -0.25
## 4 H1 1953 6 0.00
## 5 H1 1954 6 0.00
## 6 H1 1955 9 0.50
## 7 H1 1956 4 -0.56
## 8 H1 1957 6 0.50
## 9 H1 1958 7 0.17
## 10 H1 1959 8 0.14
## # ... with 247 more rows